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Abstract 

Motivation "Molecular signatures" or "gene-expression signatures" are used to predict patients' charac- 
teristics using data from coexpressed genes. Signatures can enhance understanding about biological mech- 
anisms and have diagnostic use. However, available methods to search for signatures fail to address key 
requirements of signatures, especially the discovery of sets of tightly coexpressed genes. 

Results After suggesting an operational definition of signature, we develop a method that fulfills these 
requirements, returning sets of tightly coexpressed genes with good predictive performance. This method 
can also identify when the data are inconsistent with the hypothesis of a few, stable, easily interpretable sets 
of coexpressed genes. Identification of molecular signatures in some widely used data sets is questionable 
under this simple model, which emphasizes the needed for further work on the operationahzation of the 
biological model and the assessment of the stability of putative signatures. 

Availability The code (R with C-|— 1-) is available from http : / /www . ligarto . org/rdiaz/Sof tware/Sof tweire . htmll 
under the GNU GPL. 

Contact rdiaz@cnio.es 

Supplementary information [http : //ligarto . org/rdiaz/Papers/signatures-supl .mat .pdf | 



1 Introduction 

"Molecular signatures" or "ge ne-expression s ignature s" are a key feature in many studies that use microarray 
data in cancer research (e.g.. lAlizadeh all .2000,: .Gohib el a,l\. Il999t IPomerov etoR 120021 iRosenwald el di 
|2002; Sh ipD et a?J . l2002|) . IShaffer et al\ l|200lL p. 375) refer to signatures as "(...) genes that are coordinately 
expressed in samples related by some identifiable criterion such as cell type, differentiation state, or signaling 
response" (emphasis is ours). Molecular signatures are often used to model patients' clinically relevant informa- 
tion (e.g., prognosis, survival time, etc) as a function of the gene expression data, but instead of using individual 
genes as predictors, the predictors are the signature components or "metagenes". 

If we are successful searching for a signature, then we will be able to model, for instance, the probability 
of developing a metastasis as a function of a few signature components or metagenes where each signature 
component is made of genes that show strong coexpression. Thus, molecular or gene expression signatures 
can be important both for diagnostic purposes and for providing information about the biological mechanisms 
underlying certain conditions by highlighting genes that both coexpress and are related to that condition. 

In spite of the widespread use of the term "molecula r sig i iature", no explicit definition is available . Fol- 
lowing the conve ntions of the literature fe.g.. lAlizadch ^M . l2000t iGohib et oil Il999t IPomerov 6^^. l2002t 
IRosenwald et !al],[2002; Sh ipp et QiJ . l2002HWest et oi.Ll200lh . and building upon the definition above jShaffer et al\. 
I2OOII p. 375), we will consider a signature to be composed of one or more signature components or meta- 
genes, where each signature component is a weighted combination of one or more coexpressed genes, and such 
that statistical models that use signatures both have good predictive performance and are easy to interpret 
biologically. Interpretation is made easier because the prediction is based on signature components that are 
weighted averages of subsets of tightly coexpressed genes, which can help when attempting to relate spe- 
cific biological features to, for example, particular alterations on a metabolic pathway. Based upon the above 
references, we can try to formalize these goals by requiring that signatures and signature components satisfy 
the conditions shown in Figure Q 



[Figure 1 about here] 



The conditions in Figure ^ reflect a very specific biological model. Our objective is to develop a statistical 
method appropriate for this biological model. By using a method that tries to fulfill those conditions we can also 
provide evidence that, for any particular case, the underlying biological assumptions behind this attempt are 
inconsistent with the data or, in other words, the the assumptions embodied in Figure H^^re inappropriate. As 
will be discussed later, our method is an attempt to map a particular biological model into a statistical method, 
but other statistical approaches would be more appropriate if more complex biological models are regarded as 
appropriate. 

1.1 Limitations of alternative methods 

A variety of approaches have been used to identify molecular signatures. A review is pro vided in th e supple- 
mentary material. Bri e fiy, most method s return either a single signature component fe.g.. iGolub et a l. 199!^ 
iHedenfalk et all 1200T, "va n Bell^ l2Q02l) which is a we i ghted a ,verage of a set of genes, or several sig nature 
components (e.g., .Antoniadis et all l2003t iHastie et all 12001 d iHuang et all 12003 iRamaswamv et all l2001t 
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West et all 120011) which are often obtained using dimension reduction techniques (e.g., principal component 
analysis [PCA] , partial least squares [PLS] , sufficient dimension reduction) cither on the complete set of genes 
or on a preselected subset. 

The most common problems of available methods are: 

• Genes within signature components do not necessarily show tight coexpression: no method makes tight 
coexpression a requirement to be fulfilled. 

• The interpretation of components is very difficult for most methods that use PCA or PLS, since all the 
genes to which PCA or PLS is applied have loadings on each component. 

• The search for components in many PCA or clustering of genes methods is carried out without incorpo- 
rating information from the dependent variable. 

• Most methods are designed for a specific type of task (e.g., classification or survival, but not both) and 
would be difficult to extend to other types of dependent variables. 

Our objective in this paper is to propose a method that overcomes these problems. Specifically, our method 
returns signature components of tight coexpression (and thus, signature components that should ease interpre- 
tation) with good predictive performance. Although in this paper we focus on class prediction, our method can 
be used with different types of dependent data (continuous, categorical, survival), and thus sets up a general 
framework for finding gene expression signatures regardless of the type of dependent variable. 

Based on the proposed operational definition of signatures (see figure ^ , we first discuss the key elements 
of our proposed method. Next, we evaluate its predictive performance and finally with discuss it relation with 
other methods and problems of biological interpretability. Further details about the algorithm, evaluation of 
recovery of signatures, and a longer review of alternative approaches are provided in the supplementary material. 

2 Methods 

2.1 Key elements of the proposed method 

Our objective is to directly fulfill the conditions in figure^ We start our search with a seed gene that will be the 
skeleton of a signature component; this first signature component is found so that genes within the component 
show tight coexpression and the prediction error is acceptable. We repeat this process (find seed gene for a 
component and then obtain the whole component) greedily, until no further components are needed. The main 
steps of the algorithm are shown in Figure |21 In this section we explain how the conditions in figure ^ can be 
fulfilled and provide a geometrical interpretation of the algorithm. 

[Figure 2 about here] 



2.1.1 Fulfilling signature requirements 

A common and simple way of characterizing a signature component is to use linear combinations (weighted av- 
erages) of the genes that belong to that s ignature component (e.g., Ramaswaniv et al, 2003,; Roscnwald et al^ 
l2002l:IShinn et nl\ . \2()0'i\We.f^t et r?i.Ll2nni(> . Although we could characterize a signature component usin g several 
different linear combinations of the genes of that component, most methods (but see iLiu et all l2f)f)^ charac- 
terize a signature component or metagene using only one linear combination. A single metagene per signature 
component simplifies interpretation, and is implicit in the requirement that each gene of a signature components 
should show a strong correlation with the signature component. 

Thus, to fulfill requirement one in figure^ we can use Principal Component Analysis (PCA — which is closely 
related to Singular Value Decomposition [SVD]). PCA yields "the best" representation (or "least distorting" 
repre sentation, in the least squares sense) of the original data (e.s.. I Jolliffd . l200^ iKrzanowskiL ll99St iMorrisonl 
199(1) in a subspace of reduced dimensions. The first PC is the best 1-dimensional representation of the original 
genes of this signature component. If the genes of the signature component are tightly coexpressed, then each 
of these genes should show a high correlation with the signature component, as we required above (this will also 
mean that the percentage of variance in the original gene expression data explained by this first PC will be high 
— see supplementary material) . After running the procedure, each signature component will be made of tightly 
coexpressed genes (we require that all the genes in a component show a correlation larger than a pre-specified 
threshold of rmin)- 
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In contrast to some previous methods which use PCA or PLS over the complete set of genes, there is no need 
for our method to return components which are uncorrelated or orthogonah there is no biological argument 
that requires that signature components be orthogonal, uncorrelated, or independent (see discussion). For case 
of interpretation we will additionally require that no gene belongs to more than one signature component. (In 
other words, each gene in the original data matrix belongs to either one, and only one, signature component or 
to none.) 

The second and third requirements of figure ^ can be incorporated by adding new signature components 
only if they result in a relevant reduction of prediction error, and retaining genes in a signature unless they 
produce large increases in prediction error. In other words, we will penalize adding signature components, but 
will try to obtain large signature components. The reason is that the search for molecular signatures is often 
pursued to provide biological insights into coexpressed genes related to conditions and, therefore, minimization 
of prediction error is not the only goal. Thus, if there are potential trade-offs between prediction error and 
"biologically intcrpretable signatures", the researcher should have the option of modifying the terms of this 
trade-off flexibly. 



2.1.2 Searching for signature components and a geometrical interpretation 

Our objective is, thus, to maximize predictive performance using signature components that satisfy that the 
correlation of each gene in a signature component with the signature component is larger than a given threshold. 
However, the discussion so far does not indicate how to find the signature components and, given the dimen- 
sionality of the problem, an exhaustive search for the optimal solution is not possible. Since we require that 
each component be highly correlated with the genes of that component, we can start the search with genes that 
have good predictive abilities on their own. Once we find an initial "seed gene", we build an initial candidate 
signature component by including all "promising genes" (e.g., all those with a minimum correlation with the seed 
gene), and later reduce the signature component eliminating genes until the conditions of minimum correlation 
with 1st PC (all genes have a correlation with the 1st PC > r„iin) and predictive performance are met. (If this 
elimination eliminates all genes except the seed gene, then, of course, the two requirements are met). 

The method proposed here is a heuristic search that has an intuitive geometrical interpretation. We require 
that each component be highly correlated (> rmin) with the genes in the component, which is equivalent to 
saying that the vector of the component must have a similar direction as the vectors of each gene in variable space 
(the space where subjects are the axes). Therefore, no matter which genes belong to a signature component, 
the component will have a similar direction as any of its genes. Then, it seems reasonable to start the search 
with the direction that has the best predictive ability, the seed gene; this seed gene is the single direction in 
space that most contributes to separation of the groups in a classification problem; analogous for regression or 
survival analysis. When we form the complete signature component, all other genes of the signature component 
have directions that are similar to that of the seed gene. Together, all the genes of a signature component move 
the direction slightly, but this shift is possibly towards directions that contribute more to separation of groups 
(or that at least do not degrade the separation too much) and never moves us far away from the original seed 
gene. This process is repeated until the addition of new signature components does not achieve any relevant 
decrease in prediction rate, or until a maximum pre-specified number of signature components is reached. The 
algorithm is shown in figure [21 Further details are given in the supplementary material. 



2.1.3 Choice of underlying classifier 

In this paper we will be dealing with a classification problem. Each signature component is used as a predictor 
variable for a classifier. Of the available classification methods, we have used DLDA (diagonal linear discriminant 
analysis), a versio n of linear discrimin ant analysis which assumes the same diagonal variance-covariance matrix 
for all the classes (|Dudoit et oil l20f)3l . and NN (k-nearest neighbor, with k = 1), a simple non-parametric rule 
that assigns a test sample to the class of the closest training sample (where closeness is measured using Euclidean 
distance in the space whose dimensions are the signature components). KNN and DLDA have b een repeatedly 
shown to perform as well a s, or better than, many competing methods with microarray data | Dudoi^_ei_a|J, 



l2002':'Ro mualdi et a/.l.l2003(l . In addition, DLDA and KNN are simple to implement and interpret. iDudoit et al\ 
p002 'l used an adaptive procedure to estimate the optimal number of neighbors to use with KNN; that can 
be tim e consuming, and we have fixed K = 1, since this is often a successful rule l|Hastie eA, a/.L l200ld: iRiDlevl 
I199(tI) . As discussed in the supplementary material, other classifiers can be used. 



3 Comparing predictive performance with established methods 

Here we compare the predictive performance of our method with that of three well established methods, support 
vector machines, KNN, and DLDA, using several "real data" sets. The supplementary material reports simulation 
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studies that show that the suggested method can indeed recover signatures when they are present in the data. 

Predictive performance is evaluated using 10- fold cross-validation (i.e., the complete algorithm shown in 
Figure [5] is applied to each of the 10 "training sets"). This 10- fold cross-validation was repeated 20 times under 
each condition. The error rates shown are not the CV error rates obtained in steps 1 and 5 of the signature 
algorithm (see Figure EJ, since those are biased down (see supplementary material); the error rates shown are 
the error rates obtained from cross-validating the complete procedure. 

Because an important parameter of our method might be rmin , the minimal absolute correlation between each 
gene in a signature component and the signature component, we have evaluated the performance of the signature 
method using a set of values of rmin that covers a "biologically interesting" range: {.60, .65, .70, .75, .80, .85, .90, .95} 
In addition, we have also examined the differences between using ci = C2 = 1 compared to ci = C2 = 0; the 
first corresponds to the usual "1 se rule" and should lead to more interpretable results (ci and C2 are related to 
how much we penalize adding a new signature component and how much we penalize eliminating genes from 
signature components; see details in supplementary material). 



3.1 The data sets 

Leukemia dataset From IColub et al\ l)l999|) . The original data, from an Affymetrix chip, comprises 6817 
genes, but after filtering as done by th e authors we are left with 3051 genes. Filtering and preprocessing is 
described in the original paper and in lPudoit et al\ ^Q^. We used the training data set of 38 cases (27 
ALL and 11 AML) in the original paper (the observations in the "test set" are from a different lab and were 
collected at different times). This data set is available from http : //www-genome .wi .mit . edu/ cgi-bin/ ceincer/datasets . 
and also from the Bioconductor package multtest ( http : //www .bioconductor . or^). 

Adenocarcinoma dataset From IRamaswamv et al\ (|2003fl . We used the data from the 12 metastatic tu- 
mors and 64 primary tumors. The original data set included 16063 genes from Affymetrix chips. The data 
(DatasetA_Tum_vsMet. res), downloaded from http://www-genome.wi.mit.edu/cgi-bin/cajicer/ , had 
already been rescaled by the authors. We took the subset of 9376 genes according to the UniGene mapping, 
thresholded the data, and filtered by variation as explained by the authors. The final data set contains 
9868 clones (several genes were represented by more than one clone); of these, 196 had constant values 
over all individuals. 

NCI 60 dataset From lRoss et qiJ(|2000() . The data, from cDNA a rrays, can be obtained fr om ht tp : // genome-www . steinf ord . ( 

The r aw data we used, which is the same as the d ata used in Dettline & Biihlmann ( 2003) ; Dudoit et all 
l|2002l) . is the one in the file "figure3.cdt". As in IPettlinEf fc Biihlmann (,2003) : .Dudoit et aL (,200^ we 
filtered out genes with more than two missing observations and we also eliminated, because of small sam- 
ple size, the two prostate cell line observations and the unknown observation. After filtering, we were 
left with a 61 X 5244 matrix, corresponding to eight different tumor types (note that, as d one by previ- 
ous au thors, we did not average the two observations with triplicate hybridizations). As in^y^g^^^Zj 
( 2002) we used 5-nearest neighbor imputation of missing data using the program GEPAS l^ercerc^taU . 
12003) ' (http : //gepas .bioinf o . cnio . es/cgi-bin/preprocess); unlike lPudoit et all l)2002() . however, we 
measur ed gene similarity using E uclidean distance from the genes with complete data, instead of corre- 



lation: iTrov anskava et al\ (|2001[1 found Euclidean distance to be an appropriate metric. Finally, as in 
I Dudoit et o.l 2002i p. 82) gene expression data were standardized so that arrays had mean and vari- 
ance 1 across variables (genes). 

Breast cancer dataset From Ivan Belld ()2002l) . The data, from Affymetrix arrays, were downloaded from 
http : //www. rii . com/publications/2002/vantveer .htm (we used the files ArrayData^less_than_5yr.zip, 
ArrayData^greater_than_5yr.zip, ArrayData^BRCAl.zip, corresponding to 34 patients that developed dis- 
tant metastases within 5 years, 44 that remained disease-free for over 5 years, and 18 with BRCAl germline 
mutations and 2 with BRCA2 mutations). As did by the authors, we selected only the genes that were 
"significantly regulated" (see their definition in the paper and supplementary material), which resulted in 
a total of 4869 clones. Because of the small sample size, we excluded the 2 patients with the BRCA2 
mutation. We used 5-nearest neighbor imputation for the missing data, as for the NCI 60 data set. Fi- 
nally, we excluded from the analyses the 10th subject from the set that developed metastases in less than 
5 years (sample 54, IRI000045837, in the original data files), because it had 10896 missing values out of 
the original 24481 clones, and was an outstanding outlying point both before and after imputation. The 
breast cancer dataset was used both for two class comparison (those that developed metastases within 5 
years vs. those that remain metastases free after 5 years) and for three group comparisons. 

Therefore, we use three datasets in which the problem is classification into two classes (leukemia, adenocar- 
cinoma, breast cancer), one dataset with a three class problem (breast cancer) and one dataset with an eight 
class problem. 
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3.2 The competing methods 

We have used t hree methods that have shown good perfo rmance in reviews of classification methods with 
microarray data l|Dudoit et alX 120021 iRomuald r^ l200^. 

Diagonal Linear Discriminant Analysis (DLDA) DLDA is the maximum likelihood discriminant rule, for 
multivariate normal class densities, when the class densities have the same diagonal variance-covariance 
matrix (i.e., variables are uncorrelated, and for each variable, its variance is the same in all classes). This 
yields a simple linear rule, where a sample is assigned to the class k which minimizes E^^-^(xj — x^j)'^ /^j, 
where p is the number of variables, Xj is the value on variable (gene) j of the test sample, Xkj is the 
sample mean of clas s k and variable (gene) j, and (t| is the (pooled) estimate of the variance of gene j 
l)Diidoit et oil l2002l) . In spite of its simplicity and its somewhat unrealistic assumptions (independent 
multivariate normal class densities), this method has been found to work very well. 

K nearest neighbor (KNN) KNN is a non-parametric classification method t hat predicts the sarnple of a 
test c ase as the majority vote among the k near est neighbors of the test case l|Hastie et a^.l . 12001 tHRipfevl 
To decide on "nearest" here we use, as in iDudoit et ai\ tOO'^. the Euclidean distance. The number 
of neighbors used (k) is chosen by cross-validation as in lDudoit et al\ l|2002j) : for a given training set, the 
performance of the KNN for values of k in {1, 3, 5, ... , 21} is determined by cross-validation, and the k 
that produces the smallest error is used. 

Support Vector Machin es (SVM) SVM are becoming increasingly popular classifier s in many areas, in- 
cluding microarrays l|Furev et all 1200(1: iLee fc Leel l2003t lR,amaswamv et all l200ll) . SVM (with hnear 
kernel, as used here) try to find an optimal separating hyperplane between the classes. When the classes 
are linearly separable, the hyperplane is located so that it has maximal margin (i.e., so that there is 
maximal distance between the hyperplane and the nearest point of any of the classes) which should lead 
to better performance on data not yet seen by the SVM. When the data are not separable, there is no 
separating hyperplane; in this case, we still try to maximize the margin but allow some classification errors 
subject to the constraint that the total error (distance from the hyperplane in the "wrong side") is less 
than a constant. For problems involving more than two classes there are s everal possible appro aches: the 
one used here is the "one-against-one" appr oach, as implemented in libsv r n IChang fc LinI l)2003() . Reviews 
and introductions to SVM can be found in iBureuesI jl99S|) : iHastie et al\ ll200i;|) 

For each of these t hree methods we nee d to decide which of the genes will be used to build the predictor. 
Based on the results of IPudoit et al\ ll2002fl we hav e used the 200 genes with the largest F-ratio of between to 
within groups sums of sauares. iDudoit et all l|2002|) found that, for the methods they considered, 200 genes as 
predictors tended to perform as well as, or better than, smaller numbers (30, 40, 50 depending on data set). 

We evaluated predictive performance using 10-fold cross-validation; the result s shown are from 20 replications 
of the 10-fold cv p rocess. In all cases, cross-validation includes gene selection l|Ambroise fc McLachlanL l2002t 
ISimon et oi.L 12003V in other words, for the three competing methods and the signature algorithm the selection 
of genes is carried out within each of the 10 "training sets" of the cross-validation. Thus, we insure that the 
subjects for which prediction is performed have not been used for the gene selection process. 

4 Results 

Predictive performance is shown in Figure |3I Predictive performance changes very little from using DLDA vs. 
NN, or setting ci = C2 = 1 vs. ci = C2 = 0; figures for four combinations of classifier and values of ci,C2 are 
shown in the supplementary material. Comparing ci = C2 = 1 with ci = C2 = (see Figures in supplementary 
material) does not show any relevant differences in predictive performance; of course, there are differences in the 
outcome because, not surprisingly, using ci = C2 = tends to result in more signature components of smaller 
numbers of genes per component, and higher correlations between components. 

[Figure 3 about here] 



[Table 1 about here] 

With respect to rmin, except for the NCI data set, and slightly for the Breast cancer with 3 classes data 
set, changes in rmin have little effects on predictive performance. In Table ^ we show the median number of 
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components, median total number of genes in a signature, and median average number of genes per component 
obtained in 200 bootstrap runs using ci = C2 = 1 with r,„i„ = 0.85 and r„ii„ = 0.6, and using NN as the 
classifier (see also section E!T|l . It can be seen that changes in r„ij„ do affect the outcome in terms of number of 
number of genes per component. These results seem to indicate that choice of rmin can probably be guided more 
by interpretability concerns (whether we want larger signature components of looser coexpression or smaller 
signature components of tight coexpression) than by concerns over predictive ability. 

Finally, the performance of the signature method is only slightly worse than that of the three competing 
classifiers, except for the NCI data set. As seen in Tabled most of the signatures, specially with rmin = 0.85, 
used very few components of very few genes each (compared to the use of 200 genes for the competing classifiers) ; 
thus, the predictive performance is achieved using a very small number of genes and thus potentially facilitating 
a simple biological interpretation. In the case of the NCI data set, there are eight classes with only 61 samples, 
and in most cases the signature component only returned between 1 and 3 components. Probably the forward 
sequential addition of components in the signature method has affected negatively the predictive capabilities, 
because any single addition was most likely incapable of resulting in a large enough decrease of prediction error 
to justify further addition of components (in contrast to using, directly, 200 genes as predictors). 

4.1 Stability of results 

To evaluate the stability of results, we rerun the complete procedure on all data sets using the bootstrap 
I Davison fc Hinklevlll997tlEfron fc Tibshiranilll993ll with 200 bootstrap samples, similar to what lEfron fc Gond 
1 19831) do to evaluate a complex fitting procedure. We run the procedure for settings of ci = C2 = 1 with 
rmin = 0.85 and rmin — 0.6, and using NN as the classifier. The results are shown in Table |21 The bootstrap 
results indicate that, when using a high value of rmm we rarely obtain similar solutions repeatedly; some data 
sets, however, seem to yield more stable solutions (e.g.. Leukemia data sets) and, not surprisingly, if the Vmin 
criterion is set to less stringent values, results tend to be more repeatable. 

[Table 2 about here] 



5 Discussion 

5.1 Similarities and differences with other methods 

Our method is unique because it simultaneously searches for sets of genes that arc tightly coexpressed and will 
lead to good predictive performance. The search for the sets of genes is carried out using the information from 
the dependent variable (at the first stage — when selecting the seed gene — , and at the pruning stage of reducing 
the signature component — when genes that lead to decreased predictive performance are eliminated from the 
signature component — ; see figure EJ. 

One important difference between our proposed method and most previous approaches that use PCA is that, 
by performing PCA only on subsets of genes, our method returns signature components where genes show tight 
coexpression. Because returned components are not orthogonal, and simple components are an explicit goal, 
our approach is actually closer to some ideas implemented in SAS's PROC VARCLUS, which is similar to factor 
analysis with oblique rotation and ca n be used to o b tain clusters of varia bles, of re latively simple interpretation, 
to be further used in model building iNelsm] l)200l|) : ISAS Insitutd l)l999j) : see also iHarrelll l|200l|) . 

Using PCA on subsets of genes, instead of the complete set of genes is crucial because it makes in t erpre- 
tation easier and allows for subsets of tight coexpression. Simple PCA and related methods jjolliffd . l2002t 
iRousson fc GasseIll2003^IVinesLl200n^ as well as SAS PROC VARCLUS also try to achieve components of tight 
coexpression, but many of these approaches cannot be applied with p ^ n, and all of them carry out the PCA 
without using the information from the dependent variable, which in our case is a fundamental requirement since 
the sets of genes with tightest coexpression would be irrelevant for our purposes if they are not related to the 
dependent variable we are trying to model. This difference in objectives is also evident because our aim is not 
to explain the most variance in the genes (as in most simple PCA approaches) nor maximize variable explained 
across all clusters (such as in PROC VARCLUS). Overall summarization of information is not important for our 
problem, because we are interested in prediction, and we often have the suspicion that most of the genes in the 
array are not related to the outcome variable. Finally, our approach can result in signature components which 
are correlated (sometimes strongly), but this is not inherently a problem because there is no biological reason 
to suggest that the underlying biological causes or factors ought to be independent or uncorrelated; moreover, 
if the true underlying causes are not orthogonal, using a method such as PCA can lead to interpretational 
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and conceptual diffi culties because each biological cause will be spread over several orthogonal components 
i Houle et a/'ll2002|) : non-orthogonal biological causes are inconsistent w ith procedure s such a s PCA and PLS. 

Ba yesian classification trees using the 1st PC from gene clusters l|Huang et all l2003al) and block PCA 
l|Liu et al... .2Q02) also used PCA on subsets of genes, instead of the complete set of genes. In both cases their 
subsets of genes were o btained using criteria that did not make any use of the information from the dependent 
variable. In addition, in H uang et a the metagenes are not necessarily of subsets of tightly coexpressed 

genes. In Liu et al. ( 2002) there is an explicit criterion of % variance accounted for, but often the number of 
components used to summarize a subset of genes is too large to allow for easy interpretation (11 to 16 principal 
components). 

Supervised harvesting of expression genes (|Hastie et oil 1200101) also works with clusters of subsets of genes, 
which are then used in a predictive model. As before, however, the clustering is carried out without using 
information from the dependent variable; even if the selection of which subsets or clusters to use in the model 
uses the information from the dependent variable, the very first step of clustering genes does not, and can 
therefore be unable to recover sets of tightly coexpressed genes that are good predictors. Finally, the "Wilma" 
and "Pelora" methods l|Detthng fc BiihlmannL I2OO2I do use the information from the dependent variable 

in the formation of clusters of genes; however, there is no explicit objective of achieving tight gene coexpression 
within clusters and thus, how tightly coexpressed genes are in each metagene cannot be specified in advance; in 
addition, Wilma weights each gene equally within a cluster (only possible weights are +1 and -1) whereas we 
use PCA on unsealed genes, thus allowing genes to play a different role in the specification of the direction of 
the signature component (genes with larger among-subject variance play a more important role in determining 
direction). 

The rest of the alternative methods differ strongly from our proposed method, either because they do not 
return subsets of genes but components with loadings from all genes (e.g., PLS based metho ds), or return 
subse ts of genes where there is not requirement of tight coexpression (e.g., weighted gene voting. IColub e^'oll . 

EM). 

After gene selection and dimension reduction (i.e., the use of only the 1st PC, that collapses all the informa- 
tion from the genes of a signature component onto one dimension), the predictive model of our choice is fitted. 
In this sense, the method as presented here is "just" a DLDA or NN that uses signature components instead of 
genes as the predictors. The choice of DLDA and NN was made based on publis hed results that showed their 
excellent performance with microarray data. In particular, iDudoit et al\ l|2002|) showed that other forms of 
discriminant analysis tended to perform much worse because of the small ratio sample size/parameters needed 
to estimate covariances and different variances per group; however, since in most cases our method returns just 
a few signature components, other types of discriminant analysis that use the information from the covariance 
of the predictors (e.g., linear discriminant analysis and quadratic discriminant analysis) might prove useful. 



5.2 Coexpression: across-group and within-group 

The algorithm can include in a signature component genes that show no correlation within groups but that 
show correlation among groups because they are far apart in the multidimensional space, and the correlation 
coefficient is computed across the whole, pooled, sample. The algorithm might even include in the same signature 
component genes that have very different patterns of correlation in different groups, if they still show sufficiently 
strong correlation over the pooled sample. 

To our knowledge, this issue has not been explicitly addressed in any other approach to the signature 
problem (see reviews in supplementary material). However, probably the most biologically relevant components 
are those where there is strong correlation within groups, because this is a more reliable indicator of coordinated 
expression. 

A possible solution is, for example, to only accept results for a signature component if a principal component 
analysis over the pooled sample after centering the data with respect to the group means yields a relevant first 
eigenvalue; for added robustness, we might want to use the trimmed mean. This approach, however, does not 
directly address if there are different multivariate orientations in different groups of subjects, and how these 
orientations within-group relate to the across-group orientation. In particular, the case where several groups 
not only h ave the same first principal compon ent, but are lined up along a common axis, known as "allometric 
extension" ijBartoletti et a/.Ul9 99: Hills, 1982), might con stitute the most na tural type of signature component. 

Krzanowski (see reviews and summary in .Jolliffe„ ,2002i:iKrzanowski[ll998i) has proposed a method to directly 
compare the subspaces defined by the principal components of each of the groups. In our case, as we only use 
the first principal component of a set of genes to define a signature component, for each signature component 
we can compare the first principal component of each group. An example using the NCI 60 data set is shown 
in the supplementary material. This method only compares the orientation of the princi pal components (the 
eigenvectors) but does not compare the location of the multivariate means. The EDDA l|Bensmail fc CeleuxL 
Il99fi.l and common principal components l|Fhirvl Il988t) approaches provide frameworks to examine differences 
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among covariance matrices of particular interest in the context of discrimination among groups. Nevertheless, 
it must be remembered that biological interpretati on of results is not always straightforward, specially when 
the underlying biological factors arc not orthogonal llHoule et QiJ . l2002l) . and that we need to asses the need for 
robust methods ( Bocnte et gi,, .2002) and power issues related to the small sample sizes common in microarray 
data. We are currently investigating some of these issues; however, in the presence of the difficulties associated 
with interpretation, low statistical power, and possible lack of robustness to outlying observations, right now 
the best recommendation could be to conduct an examination of patterns of within vs. among coexpression for 
the returned signatures. 



5.3 Stability and biological relevance of signatures 

Our method follows from an operationalization of signatures and signature components (figure [T]l. If the ideas 
embodied in figure ^ have empirical support, then it might be possible to build good predictive models using 
just a few, very specific biological features that can be related to alterations on particular pathway. Our method 
is unique in this regard because it returns only a few signature components where genes within each component 
are tightly coexpressed. 

On the other hand, our method can indicate that the data are inconsistent with the ideas behind figure ^ if 
the signature components show high instability, this is evidence that very different models can be obtained from 
the data. Widely different models from a reasonably large data set cast doubt on the idea that a few, easily 
interpretable, signature components strongly correlated with the expression of a few key genes a re associated to 
the cli nical out come of interest. In the context of building predictors from gene expression data, ISomoriai et al\ 
l|2003l) (see also lZhang erall . 12003^ have emphasized that this non-uniqueness leads to interpretational difficulties 
and should make researchers skeptical about the biological relevance of any set of predictors; moreover, they 
explain how this non-uniqueness can arise from dataset sparsity. Of course, both of these issues are relevant to 
the present proposed methodology. None of the data sets examined in this paper yield stable signatures when we 
use stringent criteria of gene coexpression f see 14. II and Table|5|with rmin = 0.85), although some of the data sets 
are somewhat stable from run to run when the Vmin is set to small values (but other data sets show signatures 
that vary widely from run to run). These results add to the above references in the sense that biological 
interpretation should be carried out very cautiously, an d emphasize th e difference between attempting to build 
good predictors and attempting interpretation (see also iBreimani l2f)f)l|) . More relevant to the current work, the 
present results indicate that simple models of molecular signatures warrant further critical scrutiny, and t hat it 
might be extremely hard to identify molecular signatures from such sparse data sets (see iRhodes et all . l2004l 
for a meta-analysis attempt to identify stable "signatures"). 

We must recognize that these results are preliminary for two main reasons. First, establishing that two or 
more signature components are different probably requires additional information besides the identities of the 
genes; for instance, information from Gene Ontology, or known participation on certain regulatory networks. 
Second, we have used two different rmin thresholds, but it is unclear what constitute "biologically reasonable" 
patterns of covariation between genes that are to belong to the same signature component. Nevertheless, even if 
preliminary, these results emphasize the need for further work in the operationalization and explicit definition of 
what we mean by molecular signatures, careful consideration of the stability of results, and critical assessment 
of the sample sizes need to reliably identify molecular signatures. 



5.4 Alternative statistical methods for alternative biological models 

As mentioned in the introduction, we started by trying to clarify, conceptually, what is often understood 
by molecular signature (see Figure^, and then devised a statistical method to fulfill those requirements. 
The biological model underlying the suggested method is one where most of the genes are not relevant for 
prediction, relevant genes are involved in one and only one signature component (i.e., non-overlapping signature 
components), and the signature components are common, and have similar covariance matrices, in different 
groups. 

However, other biological models are plausible, and for those biological models ot her statistical methods 
would be more appropriate. The simultaneous clustering and classification approach in lJornsten fc Yul l)200,'j) 
could be extended by placing restrictions on the covariance matrix (i.e., require a minimum correlation between 
genes) but possibly allowing for different covariance matrices among groups; thus, we could address directly 
issues of different across vs. within-group correlations (see section l5.2|) . within a formal inferential framework. 
Biological models where signature components are not common and/or do not behave similarly in different 
groups could be in vestigated using modifications of the Plaid model of iLazzeroni fc Ow en (2002) (see also 
iTurner et all 120(3) . In addition, genes with the highest correlation need not be the best candidates for being 
in the same biological pathway; activity in a pathway might just require that precursor genes get activated, 
but once a threshold is reached, it might not be very important by how much the threshold is exceeded. This 
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type of behavior could preclude strong correlation between genes that belong to the same pathway. This can 
be modelled building upon the latent class m ethods of Parmigiani and colleagues ijGarrett fc Parmigia ni. 200^ 
iParm igiani etoL l l2002HScharDf et QilEool . where signature components are based on under-, over- or baseline 
expression (instead of expression levels). Work along these lines is currently in progress in our group. 

5.5 "Just" dimensional reduction? 

Even if the current method fails to identify stable features that can be associated to molecular signatures, it can 
be a useful dimension reduction tool. Difficulties associated with a simple mapping of the returned "signature 
components" to pathways, and problems derived from instability of the found components also affect any other 
of the existing alternative methods. Thus, in the presence of instability of results, it is more appropriate 
to regard this method as dimension reduction tool that could lead to simple biological interpretation. The 
simple biological interpretation could be helped not only because of the coexpression of the genes that make 
a signature component, but also because the dimension reduction performed is quite remarkable compared to 
other methods (the number of signature components and genes returned is very small for the five data sets 
examined; see Results) with, at most, only a slight decrease in predictive performance. Moreover, the user 
can control the relative trade-offs between predictive performance and potential interpretability of results (e.g., 
coexpression of sets of genes) by changing the rmin parameter (note that if rmin = 1 the method becomes 
essentially either DLDA or NN with forward addition of genes to the model). This flexible modification of 
the trade-offs between prediction error and interpretability is of great importance in methods that are largely 
exploratory and oriented towards providing "biologically interpretable" output; in other words, methods for 
which minimization of prediction error should not be the only goal. 

6 Conclusions 

The most common methods for finding signatures present several deficiencies that do not allow them to return 
signatures and signature components that fulfill basic biological requirements. After suggesting an operational 
definition of signature and signature components, we have developed a method that follows directly from what 
are often considered as the biologically relevant signature characteristics. The method developed returns signa- 
ture components of tightly coexpressed genes and thus can facilitate biological interpretation. In this paper we 
have applied the method to classification problems, but this approach in fact sets up a framework that allows us 
to find signatures regardless of the type of dependent variable. Extension to use other classifiers is straightfor- 
ward and it should also be easy to incorporate other types of dependent variables to allow, for example, survival 
analysis. We have also shown that the predictive performance of our method is comparable to that of state of 
the art methods. Finally, our method not only could facilitate mapping pathological alterations to a few, tightly 
coexpressed sets of genes, but can also provide evidence that the underlying biological assumptions behind this 
attempt are inconsistent with the data. In the five data sets analyzed, our results suggest that identification of 
molecular signatures is questionable under this simple model. These results emphasize the needed for further 
work on the operationalization of the biological model and the necessity of critical assessment of the stability 
of putative signatures. 
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Table 1: Median values from 200 bootstrap runs for total number of genes in signatures, number of signature 
components and average number of genes per component. 
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Table 2: Stability of results using the bootstrap, with 200 bootstrap iterations. Values shown are the number 
of genes that are returned, as members of a signature component, in at least those many bootstrap runs. 
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1. Genes of a signature component should show tight coex- 
pression. We can make this more explicit by requiring that 

each gene of a signature component should show a strong 
correlation with the signature component. 

2. For a given classification/prediction problem only a few sig- 
nature components should be needed to obtain reasonable 
predictive performance. 

3. Signature components could have many genes; addition- 
ally, it often seems more desirable to include a gene in a 
signature component even if it does not belong to that sig- 
nature, than to exclude a gene that does belong to that 
signature. 

4. The same genes are used for a signature component over 
all samples (i.e., the signature components are the same for 
all groups). 

Figure 1: Requirements of signatures and signature components (see text for details). 
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1. Find the seed gene for a signature component: 

(a) Seed gene is gene with smahest cross-vahdated (CV) predic- 
tion error among available genes. (The CV prediction error is 
obtained using as predictive model the chosen predictor [e.g., 
DLDA], including all previous signatures, if any). 

(b) If CV prediction error < (CV prediction error of the previous 
signature - Ci standard error), continue; otherwise, terminate 
signature finding. 

2. If signature component = 1, eliminate all genes with (resubstitu- 
tion) prediction error > prediction error from always betting on 
the most frequent class. 

3. Build an initial signature with all the genes j where 
ab s {cor {gencj, seed. gene)) > Vseed- 

4. Obtain the signature component as the 1st PC of a PCA on the 
initial signature. 

5. Reduce signature component: 

(a) Eliminating, one by one, from the signature the gene with 
the smallest absolute correlation with the seed gene, until 
abs{correlation{xpr-j,pr^)) > rmin is met. 

(b) Eliminate, one by one, any gene for which its exclusion from 
the signature component leads to a CV prediction error < last 
prediction error - C2 s.e. (prediction error). 

6. Exclude from further consideration all genes that belong to the 
signature component just build. 

7. Return to 1. until no further components are needed. 

Figure 2: Basic steps of the signature algorithm. 
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Figure 3: Predictive performance, as a function of rmm, of the signature method using NN as classifier and 
comparison with SVM, KNN, and DLDA. Figures based on 20 rephcates of the 10-fold-CV procedure. Results 
for ci = C2 = 1. 
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